feat: implement and register EtcdCluster admission webhooks#384
Conversation
The manager wires a webhook server but never registered any webhook, so malformed EtcdCluster specs were only (partially) caught at reconcile time with no synchronous, actionable feedback at apply time (see etcd-io#380). Add a validating + defaulting webhook for EtcdCluster and register it via SetupWebhookWithManager in cmd/main.go (skippable with ENABLE_WEBHOOKS=false). Validations (each rejection carries a crisp, actionable message): - size: must be >= 1 and odd (etcd quorum); message names the next valid sizes. - version: must be valid semver; on update, must be a supported single-minor, non-downgrade transition (reuses the controller's ordered version table). - tls: provider must be one of {auto, cert-manager}; the matching providerCfg block must be present/complete; auto must not carry cert-manager config. - storageSpec: immutable once set. Defaulting: - tls.provider defaults to "auto" when a TLS block is present without a provider. Generate config/webhook + config/certmanager (cert-manager serving cert) and enable the WEBHOOK/CERTMANAGER kustomize sections, including CA injection. Tested: unit tests assert the exact rejection message text; envtest registers the webhook with a test manager and applies invalid CRs to assert rejection + message; a new test/e2e/webhook_test.go authors live apply-invalid assertions. Refs etcd-io#380 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: xrl The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @xrl. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
…n to changed fields Three correctness fixes to the EtcdCluster admission webhooks: C1: storageSpec immutability used struct `!=` on a type embedding resource.Quantity fields. `==`/`!=` compares the Quantity's cached string and *inf.Dec pointer, not the numeric value, so a re-spelled but numerically identical quantity (e.g. 1Gi vs 1024Mi) falsely tripped "storageSpec is immutable" under failurePolicy: Fail. Replace with apiequality.Semantic.DeepEqual (apimachinery, already a direct dep). Verified empirically: 1Gi == 1024Mi is false under `==` but true under Semantic.DeepEqual; 1Gi vs 2Gi remains unequal so immutability holds. C2: the storage immutability test only set StorageClassName, leaving both Quantities zero-valued, which is exactly the case where the buggy `==` happens to agree. Populate VolumeSizeRequest in withStorage() and add two cases: 1Gi vs 1024Mi must be allowed (C1 regression guard) and 1Gi vs 2Gi must still be rejected. C3: ValidateUpdate re-ran the full create-time spec validation, so with failurePolicy: Fail and no objectSelector a pre-existing even-sized or unknown-version cluster (the CRD historically only enforced size >= 1) was rejected on every spec edit, with no way to remediate. Only re-check the size and version-format invariants when those fields actually change; always re-validate TLS coherence and the transition (immutability + upgrade-path) checks. ValidateCreate is unchanged. Adds a test asserting legacy even-sized clusters accept unrelated edits and can be remediated, while a changed-but-still-even size is still rejected. Also documents that validateUpgradePath checks the declared spec transition only; the controller remains authoritative against the live image. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
…g, and spec validation
Completes the admission-webhook PR by making the generated webhook
configuration explicit and safe for production, verifying the
cert-manager caBundle injection chain renders end to end, and closing
remaining gaps in synchronous spec validation.
Webhook policy (generated via pinned controller-gen markers):
- failurePolicy=fail is now an explicit, documented decision. These
webhooks are correctness-critical for a stateful quorum store, so they
fail closed. The rule scope is exactly {operator.etcd.io/v1alpha1
etcdclusters}, so a webhook outage never blocks core cluster objects;
a documented break-glass objectSelector (label
etcd.operator.etcd.io/skip-webhooks=true) lets an admin bypass
admission for a single CR while the operator is down.
- timeoutSeconds=10 and matchPolicy=Equivalent added with rationale.
cert-manager caBundle injection:
- Added test/config render assertions that run `kustomize build
config/default` and verify the inject-ca-from annotation points at the
serving Certificate, the Certificate's dnsNames resolve to the webhook
Service FQDN (no surviving placeholders), the issuerRef and mounted
secret name line up, and both webhooks carry the hardened policy +
break-glass selector.
Validation/defaulting completeness:
- validateStorageSpec mirrors the controller's reconcile-time storage
invariants (volumeSizeRequest >= 1Mi and required, volumeSizeLimit >=
request, supported accessModes, pvcName required for ReadWriteMany) so
they are caught synchronously at apply time with actionable messages;
also validated when storage is newly added on update.
- commonName is rejected when it exceeds the documented 64-char X.509 CN
ceiling for both auto and cert-manager providers.
- New unit tests assert the exact rejection-message text; new envtest +
compile-only e2e cases cover undersized storage.
Deferred (intentionally not built): a conversion webhook. It is premature
while only v1alpha1 exists; matchPolicy=Equivalent already guards against
a future equivalent group/version slipping past the validator.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
…me up The admission-webhook feature wires config/default to ../certmanager: a cert-manager Certificate/Issuer issues the webhook-server-cert secret and manager_webhook_patch.yaml mounts it into the manager pod. The e2e suite's TestMain installed cert-manager only lazily inside an individual feature test that runs AFTER setup, so 'make deploy DEPLOY_MODE=e2e' ran with no cert-manager present: the Certificate/Issuer had no CRDs to apply against, the secret never materialized, the manager pod hung in ContainerCreating, and the DeploymentAvailable wait timed out before any webhook assertion could run. Install cert-manager (InstallCertManager blocks until cert-manager-webhook is Available) as a setup step BEFORE the deploy step. Verified live: without this the manager never becomes Available; with it the controller-manager is Available ~14s after deploy and all webhook assertions pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>
The webhook e2e fed the suite-wide etcdVersion ("v3.6.12", v-prefixed) into
every case. semver.NewVersion("v3.6.12") errors, so the new version webhook
rejects it: the happy-path 'admits valid cluster' assertion failed outright,
and the size/tls/storage reject cases passed for the WRONG reason (the
aggregated NewInvalid also carried a version error, so a substring match no
longer isolated the invariant under test).
Improvements:
- Use a hardcoded valid semver (3.6.1) for every non-version case, matching the
envtest twin, so each assertion proves exactly its one invariant.
- requireWebhookRejection helper: assert apierrors.IsInvalid (a real 422 from
the validating webhook, not a CRD/schema rejection), assert the exact
spec.<field> path, and assert that unrelated invariants are NOT present
(isolation) -- this is the check that kills the 'passes for the wrong reason'
failure mode.
- Tighten message asserts to the envtest twin's exact text: 'Use 1 or 3
instead' for even size, supported values: "auto", "cert-manager" for the
bad provider (catches a garbled NotSupported list a loose Contains pair would
miss).
- Add the break-glass objectSelector proof: a skip-webhooks-labeled even-sized
cluster is admitted (bypass) while the same spec unlabeled is still rejected.
This is the one behavior only a live API server can prove (render test is
static, envtest has no objectSelector) and the most dangerous new knob.
- Prove the positive upgrade path (3.5.1 -> 3.6.1 admitted) after the skip-minor
rejection, so a reject-all-updates regression cannot hide behind the negative
case.
- Use size 1 for the two clusters that are actually created (still odd/valid) to
cut reconcile load on the kind node.
No sleeps/polling added: admission is synchronous, so a Create/Update return IS
the decision. Live: all 7 sub-asserts green; suite wall-clock ~127-231s, the
8 assertions themselves run in ~0.3s.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
…-tagged images
The admission webhook validated spec.version with coreos/go-semver, which
rejects a leading "v". But the default registry (gcr.io/etcd-development/etcd)
publishes ONLY v-prefixed tags, and the controller built the image as
"<registry>:<version>". So with the webhook on:
- "v3.6.1" (the form config/samples actually ships -> v3.5.21, matching
origin/main) was REJECTED by the validator, and
- a bare "3.6.1" that the validator did accept rendered
"gcr.io/etcd-development/etcd:3.6.1", a tag the registry does not publish
-> NotFound / ImagePullBackOff.
The default image was therefore unpullable whenever the webhook was enabled.
Fix, end to end:
- Webhook NORMALIZES spec.version: a leading "v" is stripped before
go-semver parses it, so BOTH "v3.6.1" and "3.6.1" validate, including the
upgrade-path comparison (validateUpgradePath/checkUpgradePath) and the
no-op short-circuit. This keeps the operator's own v-prefixed samples
appliable rather than self-rejecting.
- Controller NORMALIZES the rendered tag to the registry convention
(internal/controller/utils.go: etcdImageTag) by prepending "v" when the
user supplied a bare semver, so a bare spec.version still pulls. An
already-v-prefixed value and an explicit non-semver custom tag are passed
through untouched so non-default registries are not disturbed.
- Controller's reconcile-time upgrade check now compares and parses on the
normalized form, so the v-prefixed live image tag and a bare spec.version
for the same release are not mistaken for a version change (and parse
cleanly through validateEtcdUpgradePath).
Tests: webhook accepts "v3.6.1" and "3.6.1" and a v-prefixed single-minor
upgrade path (skip-minor still rejected); etcdImageTag and an end-to-end
StatefulSet render assert the tag is v-prefixed for the default registry.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
Summary
The manager already constructs a webhook server (
webhook.NewServer(...)incmd/main.go) and the kustomize scaffolding carries[WEBHOOK]/[CERTMANAGER]markers, but no admission webhook was ever implemented or registered forEtcdCluster. The server started and served nothing, so malformed specs were only (partially) caught later during reconciliation, with no synchronous, actionable feedback atkubectl applytime.This PR implements and registers a validating + defaulting admission webhook for
EtcdCluster.Refs #380
What it does
SetupWebhookWithManageris called fromcmd/main.go(a minimal, localized edit, skippable withENABLE_WEBHOOKS=falsefor localmake runwithout serving certs).Validating (
ValidateCreate/ValidateUpdate)Every rejection returns a crisp, actionable message telling the user exactly what is wrong and how to fix it:
spec.size— must be>= 1and odd. An even-sized etcd cluster tolerates no more failures than the next-smaller odd size while being more likely to lose quorum; the message names the two nearest valid sizes.spec.version— must be valid semver. On update, the current→target transition must be a supported single-minor, non-downgrade path, reusing the same ordered version table (etcd/api/v3/version.AllVersions) the controller uses at reconcile time.spec.tls—providermust be one of{auto, cert-manager}; the provider-specific config block must be present/complete (cert-manager requiresissuerName, andissuerKindif set must beIssuer/ClusterIssuer); theautoprovider must not carry cert-manager config.spec.storageSpec— immutable once set (changing/removing the PVC template after the StatefulSet exists cannot be reconciled in place).Defaulting (
Default)spec.tls.providerdefaults toautowhen a TLS block is present without an explicit provider (matches the documented behaviour ofTLSCertificate.Provider).Sample rejection messages
Logging
The webhook logs every admission decision (admit / default / reject) with structured context — object name/namespace, operation, and per-field reason/remediation — so an operator reading the webhook-server logs can reconstruct exactly why any request was accepted or rejected.
Manifests
make manifestsgeneratesconfig/webhook/manifests.yaml; this PR adds the standardconfig/webhook(service + kustomizeconfig) andconfig/certmanager(self-signed Issuer + servingCertificate) scaffolds, themanager_webhook_patch.yaml, and enables the[WEBHOOK]/[CERTMANAGER]sections inconfig/default/kustomization.yamlincluding cert-manager CA injection.kustomize build config/defaultresolves the service name/namespace andcert-manager.io/inject-ca-fromannotations correctly.Testing
api/v1alpha1/etcdcluster_webhook_test.go): asserts the exact rejection-message text for size/version/TLS/upgrade/immutability and the defaulting behaviour.api/v1alpha1/webhook_envtest_test.go): installs the generated webhook config, runs a manager serving the webhook over the envtest-provisioned cert, applies invalidEtcdClusters against a real API server and asserts rejection + message; confirms defaulting and a valid single-minor upgrade are admitted.test/e2e/webhook_test.go, new file): live apply-invalid assertions via the e2e-framework feature harness.go build ./...,go vet ./...,golangci-lint, the unit tests, and the envtest webhook tests all pass; the e2e file compiles (go test -c ./test/e2e/). The e2e file is self-contained (its own helper) to avoid conflicts with other in-flight e2e PRs.🤖 Generated with Claude Code